Wikileaks leak indexer

This was my first project I ever did. It was during my first year in computing science. I had no prior knowledge of programming besides a introductory course in C++. But when the opportunity arose during a security lecture I just had to grab it. The security teacher was befriended with a research Journalist Huub Jasper. He wanted an easy way to search trough the Wikileaks documents that just got leaked at the time. So he asked our Security professor if he knew some students that might be interested. The following lecture he asked the whole lecture room who was interested and 5 people raised their hands. Erik Boss, Sjors Gielen, Rik Harink, Nick Overdijk and Dennis Brentjes (me).

The project was pretty time intensive and I had to learn a lot and be quick on my feet as I was the least knowledgeable member of the group at the time. But in the long run this project was a fun and wonderful experience. The cooperation with the Research Journalists was refreshing. In a way they are power-users of search machines, but they don't necessarily know how to express their power-user needs. This became obvious when we started testing the first versions of the software with small group of researchers. Some of them compared this to another search engine called Lexus Nexus and highlighted missing features. Some of these features were then implemented by us.

The project culminated in the VVOJ Legebeke Legaat 2011 where Huub Jaspers presented this product to a large group of Dutch of research Journalist. We also hosted a small workshop on the site which unfortunately was planned alongside other interesting talks and therefore didn't attract that many people. But the research tool did come up during a discussion panel with some prominent editors of the Dutch press. The discussion was focused on how to disclose the information contained in the Wikileaks documents now this search engine exists. The documents were un-redacted, and could pose serious threats to the people disclosed in those documents. Huub Jasper explained that only other journalists that approached him, VPRO or Argos would get access. This was decided by Huub Jasper in the beginning of our project. Although other public search engines did exists it was a matter of principle to not disclose possibly dangerous information. Also the added capabilities to search for dates and geo-coordinates made him decide to make it publicly available.

Looking back at this project we could have done things differently. Although we did use standard search engine techniques like reversed indexes and smart merging of result vectors. We could have implemented the search engine in a standard search engine package like Xapian. But as we didn't find Xapian when we started this project we implemented everything on our own and this was a good learning experience for all of us. We were able to do full text search, search for dates and date ranges and even tried our hand on geo-coordinates. But the most important thing; the search engine is tailored to the needs of the research journalists.

The downside of our 'roll your own' searchengine was it's scalability after indexing the afghan and iraqi warlogs and all the released cables we used up all of our 16GB ram. We little to no idea how to reduce the ram usage without dramatically impacting the performance. Nowadays we have some ideas how to this mostly due to experience in software constructure/architecture that we now have, for which this project was a great kickstarter.

The logo was created by Erik Boss